Performance Comparison of Apache Spark and Tez for Entity Resolution

نویسندگان

Ikram Ul Haq

Eike Schallehn

Xiao Chen

چکیده

Entity Resolution is among the hottest topics in the field of Big data. It finds duplicates in datasets, which actually belong to same entity in the real world. Algorithms that perform Entity Resolution are computation intensive and consume a lot of time especially for large datasets. A lot of research has been conducted for improving Entity Resolution solutions. A number of algorithms are developed, in attempt to reduce the time required to execute Entity Resolution algorithms on a given dataset. Efficiency of Entity Resolution algorithms has significantly improved but is still not adequate for large datasets in the Big data field. We are contributing to enhance its performance in terms of time, not by improving the algorithm but finding the most suitable platform on which it runs. This would, in turn, increase its efficiency and indirectly elevate the accuracy of Entity Resolution by empowering it to run more computation intensive algorithm. We have shortlisted Apache Spark(RDD, DataFrame and Dataset) and Apache Tez (Hive) as the set of platforms. In this research work we have chosen the Blocking technique for implementing Entity Resolution in the four above mentioned different applications. We have performed a number of experiments with different configurations to find the most efficient platform by analyzing, comparing and evaluating the results in great detail.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Don't cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling

Understanding the performance of data-parallel workloads when resource-constrained has significant practical importance but unfortunately has received only limited attention. This paper identifies, quantifies and demonstrates memory elasticity, an intrinsic property of dataparallel tasks. Memory elasticity allows tasks to run with significantly less memory that they would ideally want while onl...

متن کامل

Roaring bitmaps: Implementation of an optimized software library

Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix’s Atlas, LinkedIn’s Pivot, Metamarkets’ Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Servic...

متن کامل

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases —queries— which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource require...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

In-Stream Big Data Processing

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that realtime query processing and in-stream processing is the immediate need in many practical applications. In recent years, this idea got a lot of traction and a whole bunch of solutions like Twitter’s Storm, Yahoo’s S4, Cloudera’s Impala, A...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Performance Comparison of Apache Spark and Tez for Entity Resolution

نویسندگان

چکیده

منابع مشابه

Don't cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling

Roaring bitmaps: Implementation of an optimized software library

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

In-Stream Big Data Processing

عنوان ژورنال:

اشتراک گذاری